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Abstract 

Some thoughts on large-scale computer-aided statistical mathematics (primarily 
simulation) which were presented at the 6th Annual Conference on the Computer 
Science/Statistics Interface conference are presented. Comments of participants 
and panelists (D. F. Andrews, J. N. Arvesen, D. P. Gaver, and G. Marsaglia) have 
been added to the original text. 



1. INTRODUCTION 

The aim of this paper is to stimulate discussion 
on large-scale computer-aided statistical mathe- 
matics (primarily simulation) at this conference. 
There has been a lot of discussion of computers 
and statistics (Hartley, 1972; Milton and Nelder , 
1969; Chambers, 1970), but little of large-scale 
use of computers in simulation experiments to 
solve open distributional problems. This has per- 
haps been because of the unavailability of large 
computers and large amounts of computer time to 
research workers and statisticians. I think this 
will change rapidly over the next ten years as 
internal computation speed and the size of random 
access memories go up. The talk given by Dr. A. 

G. Anderson at this conference has amply illus- 
trated this trend. 

To take advantage of this availability, and to use 
the internal time sharing of central processing 
units inherent in multiprogramming, new statistical 



techniques which are computation-oriented will have 
to be developed. There is already growing impetus 
in this direction and this new technology coupled 
to the computers will make an enormous impact on 
statistics. I should note, too, that large-scale 
simulations are commonplace in industry and devel- 
opment laboratories, but the inefficiency of most 
of these computations is appalling. 

There are recent surveys of several aspects of 
statistical computing (Hemmerle, 1967; Halton, 

1970; Chambers, 1970; Freiberger and Grenander, 
1971), most notably that by Tukey (1972b) who has 
been responsible for many of the new ideas in sta- 
tistical computation. Consequently, I will only 
describe here the evolution of a computer program 
called COMPSTAT which was developed to try to use 
the IBM 360/91 computer at the IBM Research Center 
as efficiently as possible. The problems encoun- 
tered in developing this program, some solved but 
others open, are more than enough for one paper. 



*Sponsored by the Office of Naval Research through Contract NR 042-288 and the 
Foundation Research Program at the Naval Postgraduate School. 
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My interest in the problem of large-scale 
statistical computation grew from the frustration 
of trying to deal with non-normal time series, in 
particular, point processes (Cox and Lewis, 1966), 
and of having to write a book around the many gaps 
in the distribution theory. The first problem I 
tackled was the distribution of product-moment 
statistics (Lewis and Goodman, 1970) , since one 
can, in principle, find recursion relationships 
to generate the distribution for successive sample 
sizes n. It took six months to verify the 
mathematics, six months to program it, and even 
then I wasn’t sure enough of the programming to 
publish the results. I then turned to simulation, 
and quickly ran into several equally frustrating 
problems : 

(1) Many procedures, notably variance re- 
duction techniques, were very particular 
to the problem at hand and difficult to 
generalize. For example, a technique 
which works in estimating the mean of a 
distribution may not work when one is 
also interested in estimating the var- 
iance or higher moments. 

(2) Most published statistical estimation 
(point and interval) techniques were 
"valid" asymptotically, and were pro- 
hibitively expensive in terms of number 
of operations (addition and multipli- 
cation) and memory cells required. 

(3) Most "canned" routines were slow and 
generally unreliable. 

(4) "Tooling up" took an excessive amount 
of time, and storing results, tabu- 
lating results and manipulating results 
was difficult. 

It was therefore decided to look into the proce- 
dures and algorithms available, program them 
efficiently if they were useful, develop new 
techniques which were fast and economical of 
storage where necessary, and put them into a 
standard program which could be used for large 
scale simulations. 



Several guidelines were set: 

a) All procedures were to be computationally 
simple, use as little memory as possible 
and to be as broadly applicable as pos- 
sible. In particular, this meant they 
should use as little information as 
possible about the statistic, say S, to 
be simulated. For example, one might not 
want to use the specific information that 
S was positive. 

b) To utilize the speed of the computers, 
the best way seemed to be to compute the 
distributions of as many statistics as 
possible simultaneously. 

c) Memory requirements should be kept fixed 
and relatively small in order to use 
excess CPU (Central Processing Unit) 
time by running in a lowest priority par- 
tition in a multiprogrammed invironment. 

A block diagram of the overall program, COMPSTAT, 
which was developed is shown in Figure 1. We dis- 
cuss this program generally before going into 
details of implementation and unsolved problems 
in later sections. 

Referring to Figure 1, the symbol n is used to 
refer to sample size in statistical simulations, 
so that the statistic S might be the average 
of the observations in a random sample of size n. 
Any value of n may be used in the program, 
though it is written so as to repeat simulations 
on successive values of n if required. A number 
m of replications is specified by the user, with 
the option of splitting m into r blocks of 
size m' each (m = rra’). This is done to obtain 
estimates of the variance of estimates and also 
to allow for checkpoints to be taken. 

On each replication the STATISTICS GENERATOR can 
call for Z ^ n random numbers, unsorted or 
sorted by magnitude. The user writes the STA- 
TISTICS GENERATOR, specifying up to 32 statistics 
(functions of the Z random variates) . This 
has proved to be a very flexible arrangement; the 
statistics could be, for example, 
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FLOWGRAPH OF COMPSTAT PROGRAM 
Figure I. 



a) the sample serial correlations of lags 1 
to 32 in a series of random variables 
of length n; 

b) the waiting times of 32 successive cus- 
tomers in a simulated queue; 

c) an estimate of a parameter in a distribu- 
tion, the jackknifed estimate of the 
parameter, the jackknifed variance, and 
the pseudo-values; 

d) 32 points in the simulated spectrum of a 
time series of length n. 



There are many other possibilities. For each of 
these statistics the user can specify that he wants 
estimates of the first four moments of S, 16 
quantiles of the distribution of S, and 16 per- 
centiles of S (or any combination of these three). 
Quantiles here is used to mean the solution 
of the equation a = F^(x^) , where a is^ gi ven an( ^ 
F g (x) is the distribution of S. A percentile is 
jus t F(x) for given x. We assume the quantile 
exis ts . 
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Mean 



Stand. Dev 



Lower Quantiles 





7 


Hr 


x o. 001 


X 0, 002 


X 0. 005 


x o. 01 0 


X 0. 020 


X 0. 025 


X 0. 050 


x o. 1 00 


Normal 
( Exact) 


11. 982 


2. 562 


6. 989 


7. 197 


7. 512 


7. 789 


8. 11 3 


8. 229 


8, 642 


9. 165 


Exponential 


1 1. 824 
(0. 001 ) 


2. 827 
(0 001 ) 


6. 133 
(0, 005) 


6, 378 
(0. 003) 


6.746 
{0. 003) 


7. 068 
(0, 002) 


7, 44 5 
(0, 002) 


7. 580 
(0. 002) 


8. 063 
(0. 002) 


8 668 
(0, 001) 


l/2 Weibull 


213, 828 
(0. Oil) 


85. 678 
(0. 033) 


67. 580 
(0. 067) 


72. 483 
(0. 080) 


80. 052 
(0. 082) 


87, 068 
(0. 059) 


95. 410 
(0. 032) 


98. 523 
(0. 030) 


1 09. 707 
(0. 036) 


124. 384 
(0. 032) 




Skewness 


Kurtosis 






Upper Quantiles 












*1 




X 0. 900 


X 0. 950 


X 0, 97 5 


X 0. 980 


X 0. 990 


X 0. 995 


X 0. 998 


X 0. 999 


Normal 
( Exact) 


1. 442 




1 5. 324 


16. 764 


18. 176 


18. 627 


20. 024 


21 . 41 5 


23. 251 


24. 638 


Exponential 


1. 061 
(0. 003) 


2. 120 
(0. 023) 


1 5. 51 4 
(0. 004) 


17. 064 
(0. 005) 


18. 572 
(0. 005) 


19. 061 
(0. 006) 


20. 552 
(0. 009) 


22. 04 5 
(0. 009) 


24. 055 
(0, 024) 


25. 557 
(0, 033 


1/2 Weibull 


1. 557 
(0. 003) 


5. 082 
(0. 035) 


323. 012 
(0. 084) 


373, 912 
(0. 059) 


425. 816 
(0. 136) 


442. 749 
(0.172) 


496, 882 
(0. 182) 


553. 934 
(0. 328) 


634. 068 
(0. 472) 


700, 534 
(1.696) 



Table 1 



Table 1 shows the form chosen to tabulate the 
results (moments and quantiles) of a simulation 
involving m replications for each n. These 
results are averages and sample standard devia- 
tions of the results of the r blocks of m 1 
replications, all of this being stored in an 
archive which Tukey has aptly called the SIDEPUT. 
The estimated standard deviations of the estimates 
are given in brackets just below the estimates; 
below them we give (not shown) the estimated 
quantiles after subtraction of the estimated mean 
p and division by the estimated standard devia- 
tion a. This allows the experimenter to judge 
whether the statistic is approximately normally 
dis tributed . 

The last blocks in Figure 1 allow for CUMULATIVE 
TABULATION on n, EDITING and SMOOTHING of the 
results (including rounding and printing tables 
for publication) , and GRAPHICAL OUTPUT as shown 
in Figures 2 &3. It is easy to see in the figures 
that this statistic is not normally distributed 
and is converging very slowly with n to the 
asymptotic (n-* 50 ) distribution. The positive 



skewness of the distribution is also evident. 

An original, rather inefficient, COMPSTAT program 
was used to implement a study of tests of inde- 
pendence in point processes. Twenty statistics 
were computed simultaneously on an IBM 360/91 in 
a 120K partition. Some of these results have been 
published (Lewis, 1972) and are partially repro- 
duced here (Figures 2 and 3) ; others will appear 
later . 

A study on a similar scale of robust estimates of 
location was undertaken at Princeton (Andrews, 
et al, 1972); they had the advantage over me of 
both manpower and expertise. 

It is hoped to rewrite the COMPSTAT program at 
some later time in order to incorporate all of the 
recent advances in statistical computing technol- 
ogy described below. 

2. DETAILS 

We discuss now the details of the implementation 
of a program such as COMPSTAT. At its inception 
in 1966 we quickly ran up against the lack of real 



Figure 2 



computational considerations in many standard 
statistical procedures. The situation is better 
at present, with books such as Hemmerle (1967) 
and Knuth (1969) now available. Knuth (1969) in 
particular is invaluable. There are still many 
problems, however, particularly relating to large- 
scale computations. 

a) Random Number Generation 

Clearly the statistical quality of the random 
numbers available for large-scale simulations 
will be the limiting factor in how far one can go 
in utilizing large-scale computers in simulations. 
In 1966 the main generator in use was RANDU in 
the IBM SSP package. It is still widely used to- 
day, by default, even though it is known to 



knowledgeable users to have poor statistical pro- 
perties. There are no published test results on 
RANDU, except one brought to my attention at the 
conference (Bates and Zirkle, 1971) but there are 
papers published on problems which have been en- 
countered with its use. Moreover, as a statistical 
consultant one comes up against many cases in which 
strange results in simulations are remedied by re- 
placing RANDU by another random number generator. 

In this respect it might be noted that if statis- 
ticians are guilty of ignoring computational 
aspects of their procedures, computer scientists 
are equally guilty of ignoring the statistical 
aspects of algorithms. There are hundreds of 
clever random number algorithms in the literature 
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Figure 3 



[see the bibliography by Nance and Overstreet, 
1972] but there are virtually no accompanying test 
results published. Moreover, it is generally dif- 
ficult to do so in a computer journal. 

In 1966 we started to investigate the problem of 
random number generation for large-scale computa- 
tion, and after extensive testing we developed 
a 31-bit pseudo-random number generator for the 
System 360. This is a multiplicative congruential 
generator of the form (Lewis, Goodman, and Miller, 
1969) 

x i+1 = Ax i (mod p) , 

31 5 

where p, a prime, is 2 - 1 and A = 7 is a 



positive primitive root of p, thus guaranteeing 
a cycle of length p for the generator. Another 
advantage of using a positive primitive root for 
the multplier is that low order bits are also 
random. Beside the assembly language version given 
in the paper, a very fast version of this 
generator which generates arrays of integer or 
floating point numbers is available in the new IBM 
SL/MATH package. We will refer to this as the GGL 
generator; it generates a random number in 1.40 
ysecs on a 360/91 and in 16.00 ysecs on a 360/67. 

A version has been written for the 360/67 at the 
Naval Postgraduate School using a division simu- 
lation algorithm due to Lehmer (see Payne, Rabung, 
Bagyo , 1969; Liniger, 1961). 
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Test results for GGL are given in Lewis, Goodman, 
and Miller (1969) and extensive subsequent use 
turned up no obvious problems, though constant 
care was exercised. For example, in Table 1 the 
maximum periodogram values with 1/2-Weibull dis- 
tributed variates (squares of exponentially dis- 
tributed variates) are very large. In this simu- 
lation r = 6, m' = 750,000 and m = 4,500,000. 

As a check, normal deviates were used in a very 
large simulation and no discrepancy from the 
exact distribution (shown in Table 1) even at 
0.999 quantiles was found. This GGL random 
number generator was used in the Princeton study 
(Andrews, et al , 1972) and is used in APL. 

Nevertheless, valid doubts continue to be expres- 
sed about the use of the GGL generator in large- 
scale computations, particularly in light of 
Marsaglia's results (Marsaglia , 1968, 1972) on the 
structure of sequences of numbers from congruen- 
tial generators. These results have shed a lot 
of light on problems which can be encountered 
with congruential generators, but I don't be- 
lieve they tell the whole story. There are also 
new types of generators being advocated. Some 
of these are too cumbersome for consideration, 
but others are popular, in particular the Taus- 
worthe or shift register generators (Tausworthe, 
1965). Some doubt has been cast on the statis- 
tical properties of these shift register pseudo- 
random number generators recently; my own 
preference is, for speed and simplicity, to go to 
shuffled congruential generators (Marsaglia and 
Bray, 1968). Tukey (1972) ascribes this idea 
to Gentlemen but it seems quite old and was put 
forward by Marsaglia in the early 1960's. I 
have found no documentation of the statistical 
properties of shuffled generators, although they 
are intuitively appealing. 

We have undertaken further statistical tests of 
some of the above generators at the Naval Post- 
graduate School. In particular, we have been 
interested in correlating test results with 

* I am indebted to Dr. L. R. Turner, NASA 
for these numbers. 



results of Couveyou and MacPherson, 1967; (see 
also Knuth, 1969, pp. 82-100.) and Marsaglia, 1972. 
It seems to me that the Couveyou-MacPhe rson Fourier 
analysis is the best analytical tool for predicting 
performance of random number generators that has 
appeared. Marsaglia's results [1972] on the lat- 
tice structure of the congruential generators are 
also useful. 

The tests referred to started with the GGL random 
number generator, RANDU and a TAUSWORTHE generator 
(Tausworthe, 1965) and the runs test. As in Lewis, 
Goodman, and Miller (1969) , runs of length eight 
or longer are pooled and a Chi-square statistic 
computed. Nominally, this has a Chi-square dis- 
tribution with 7 degrees of freedom and we denote 
2 

it as x 7 * Table 2 gives summary statistics on 
16 

20 runs of 2 numbers each for the three gener- 
ators. The Tausworthe generator is now analytic- 
ally known to have poor runs performance (Toothill, 
Robinson, and Adams, 1971). The runs test rejects 
neither GGL nor RANDU if the test statistic is as- 
sumed to be distributed as a Chi-square variate 
with 7 degrees of freedom. As mentioned before, 
RANDU is known to be poor; this is shown in iable 
3 giving the Couveyou-MacPher son wave numbers for 
GGL and RANDU for dimensions up to 7. It is in 
higher dimensions that RANDU is particularly poor.* 



Table 2. Runs test; Chi-square statistic . 





X 7 




GGL 


6.846 


4.084 


RANDU 


7.939 


3.502 


TAUSWORTHE 


9.972 


10.428 


Table 3. Wave 


numbers for two 


generators . * 




GGL 


RANDU 


Dimension 






2 


16,807 


23,172 


3 


638.9 


10.86 


4 


146.25 


10.77 


5 


67.21 


10.77 


6 


29.92 




7 


16.55 




Lewis Research Center, Cleveland, 


Ohio , 
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A comparison of the three samples of size 20 using 
a two-sample Kolmogorov test rejects the hypothesis 
that any of them are from the same distribution! 

This is a sad result for large-scale simulation, 
particularly if one were trying to simulate the 
distribution of the Chi-square summary statistic, 

Xy, for the runs test. 

The three generators have been shuffled and the 

Chi-square statistics of 100 tests for samples 
16 

of size 2 numbers from each generator were 
found to be dis tributionally commensurate. The 
shuffled Tausworthe generator was still suspect, 
however . 

Results of this testing will be given elsewhere; 
other statistical tests are being evaluated. The 
conclusions so far are interesting. There is 
mild evidence that shuffling helps. The main con- 
clusion seems to be, however, that the runs test 
is virtually useless. (Note that Bates and Zirkle 
accept RANDU, partly on the basis of runs tests.) 

And although recent books (Newman and Odell, 1971; 
Maisel and Gnugnoli, 1972) tout the runs test as 
amongst the best, I have been unable to find any 
documentation for this. It seems to be an example 
of a stochastic rumor to which I too have contri- 
buted (Lewis, Goodman, and Miller, 1969). Perhaps 
some readers can guide me to work on the power 
of the runs test; Lehman (1959, p. 155) points 
out that a modified runs test has certain optimum 
properties in testing for independence in a binary 
sequence against lst-order Markov alternatives. 

Even results on the power of the runs test re- 
lative to the serial correlation test for first 
order normal autoregressive schemes would be of 
interest . 

There is clearly much work to be done in random 
number generation. I have also not mentioned 
the need for efficiently generated, reliable nor- 
mally distributed random variates. These, and 
perhaps several other kinds of random deviates 
should be provided as primitives, just as random 
numbers are provided as primitives in APL. A 
package to generate normally and exponentially 
distributed random variables is available from 
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Marsaglia at McGill University. It uses some of 
Marsaglia’s own methods and is very fast. A survey 
of some of these methods is given by Ahrens and 
Dieter (1972). 

b) Ordering . 

Ordering (sorting) 0 f quantities, or obtaining 
ranks, is a basic operation in statistical compu- 
tation; a survey is given by Martin (1971). The 
main use we had for it initially was in quantile 
estimation, and here it was a bottleneck since, 
in general, ordering of n quantities takes a 
number of operations proportional to n(£n n) and 
memory capacity proportional to n. The quantile 
estimation problem is discussed below; the use of 
ordering is now used mainly in COMPSTAT in gener- 
ating statistics such as the median. There are 
several points to be made here. 

(i) Uniformly distributed random variates 
can be ordered by address modification schemes 
(Isaac and Singleton, 1956) in time propor- 
tional to n, although for large computations 
3n memory positions are needed to avoid over- 
flows. An algorithm for this type of sorting 
is provided in COMPSTAT. 

(ii) It is clear that by using pilot estimates 
of a non-uniform distribution, address modi- 
fication schemes can be used on any data. 

These take, asymptotically, n operations, 
but for reasonable sample sizes the procedure 
is slow and cumbersome programming- wise . The 
scheme is due to Floyd at Stanford. There is 
renewed interest in this area and Chambers 
(1971) has a scheme for partial sorting which 
is more efficient than an n(£n n) sort. 
Andrews (personal communication) also has a 
scheme for obtaining the median; it uses a 
pilot estimation scheme and is subject to 
overflows which could be a problem in large- 
scale simulations. 

(iii) Schemes using the Markov property of 
the gaps (differences between successive order 
statistics) (see David, 1971, p. 17) are 
available for producing ordered uniform var- 
iates (Schucany, 1972; Lurie and Hartley, 1972) 



We had tried in COMPSTAT an equivalent scheme 
based on the independence of the gap statis- 
tics for exponentially distributed variates. 
These schemes for moderate n are more time 
consuming than the n(log n) schemes, but are 
more efficient in use of memory space. Their 
primary use would seem to be when only a f ew 
of the low or high order statistics are needed. 
Two points should be made here: 



1. It is computationally easier to generate 
high order statistics, rather than the low 
order statistics advocated by Lurie and 
Hartley (1972) and Shucany (1972). Denoting 
the uniform variates by tL and the ordered 
uniform variates by U^, we have 



rob 



{ u (d * u a 



i> |U i+i u (i+uj 



1 _ 


u (i) 1 


} 1 


U (i+i) ’ 



(i 1,2,. ..n; u (i) <; u (±+1) , = 1) 



If only low order uniform order statistics are 
required, they are generated as = 1 - 

U (n+l-i) ' 



2. The time consuming operation in the above 
is to take the (l/i)th power. This is done, 
usually, using logarithms and the scheme is 
then equivalent to generating order statistics 
from a unit exponential distribution. However, 
since it is much f as ter to generate exponen - 
tial variates using some of Marsaglia ’ s sam - 
pling procedures than it is to generate them 
by taking the logarithm of _a uniform variate , 
it is faster to generate ordered uniform ran- 
dom numbers by starting wi th exponential var - 
iates . 

The basis for this is that if E^, i = 1» 2, 
...n, denotes ordered unit exponential vari- 
ates from a sample of size n, and we let 

= 0, the gap statistics (Cox and Lewis, 
p. 26-27) 

D (i) = E (i) " E (i+1) (l = n) 

are independent exponentials with mean 
E(D(i)) = (n+l-i) 1. Thus if we have n unit 
exponentials, generated say by one of Marsag- 
lia’s schemes, we generate 



i 



E “’ V 


(n+l-j) E. 
1 J 


(i * 1, .. 


• n) 


and 








U (i) = 1 


- exp <E (i) ; 


(i = 1, .. 


.n) . 


An n(log n) 


sorting and 


ranking scheme 


is also 



provided in COMPSTAT. for sorting and ordering with- 
in the STATISTICS GENERATOR. 

c) Quantiles and Percentiles . 

Estimating quantiles was the second biggest bottle- 
neck in implementing COMPSTAT. Quantiles are more 
basic in characterizing distributions than percen- 
tiles, although, for example, one is interested in 
percentiles when evaluating by simulation the power 
of a test based on a statistic S. Thus, given 
the a-quantile x^ of S under a null hypothesis, 
one wants the percentile corresponding to S and 
x^ under a different hypothesis. 

Percentile estimation as a binomial process is es- 
sentially straightforward and ideal by our criter- 
ion of simplicity and economy of computation and 
memory requirements. It is also unbiased. However, 
it appears that greater efficiency should be ob- 
tained by coupling estimates at different x^’s, 
although I haven’t been able to do so. Most schemes 
appear to require assumptions about boundedness of 
the probability density function. Somerville (1970) 
has some results in this area; it appears to be ar. 
area for further research. 

Quantile estimation based on order statistics is 
advocated in most texts (see David, 1971'). For 
large-scale computation the sorting time required 
and the memory capacity is prohibitive. Stochastic 
approximation schemes (Robbins and Monro, 1951; 
Hodges and Lehman, 1956) were then tried but found 
to converge at an impossibly slow rate for large 
quantiles. These two quantile estimation schemes 
are prime examples of statistical procedures which 
are not attuned to computing realities, and whose 
asymptotic properties are deceptive as far as prac- 
tical applications are concerned. 

A solution was finally found (Goodman, Lewis, and 
Robbins, 1972) which combined the stochastic ap- 
proximation with a data transformation. Typically 
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if the a-quantile was required (a>0.5), the 
maxima of successive groups of size v of reali- 
zations of S are found. The problem is then 
one, if a’ = a V , of finding the x^, quantile, 

which is equal to x , in a distribution which 
th ^ 

is the v— power of the distribution S, F (x) . 
By taking v large enough to make a* ~ 1/2 the 
problem becomes one of estimating a median, al- 
though other values of v can be used. Stochastic 

approximations work well with medians, but as the 

- 1/2 

bias is apparently of order m , jackknifing 
is required to reduce the bias. 

The present scheme (Goodman, Lewis, and Robbins, 
1972) based on the maximum transformation and 
stochastic approximation solves the basic pro- 
blems of quantile estimation, but research is con- 
tinuing to improve it. Computationally it is very 
good since finding a maximum requires only two 
memory cells and computation time is linear in m, 
the number of realizations of S which are gen- 
erated . It is also simple to compute in parallel 
the quantiles for several levels, e.g. a = 

0.990, 0.995, 0.999. 

D. Salsburg has raised the question as to whether 
one wouldn’t want to order the data anyway to do, 
for instance, a normal probability plot of the 
simulated distribution. This may be true for 
samples of size m equal to about 500; beyond 
that the sorting in a large-scale simulation be- 
comes onerous, time-wise and memory -wise, and I 
feel a plot using 16 quantiles, as in Figure 2, 
plus the moments in Figure 3, is as good as or 
better than a full probability plot. 

d ) Bi a s and Bias Reduction . 

It is essential for sensible and interpretable 
simulation results to have estimates of the var- 
iances of the simulated quantities. However, 
sectioning the m replications in a large-scale 
simulation into r sections of m’ replications 
to estimate the variance of estimates (see 
Mosteller and Tukey , 1968) brings in problems of 
bias. This is because one wants r to be about 
10 to get reliable estimates of the variance, but 
the resulting m f may be too small to reduce the 



bias in the simulated quantity to acceptable levels. 

This problem seems to be well in hand because of 
the jackknife technique for bias reduction which 
was developed by Quenouille (1956) , pushed by 
Tukey (1958) and generalized by Schucany, Gray, 
and Owen (1971) and Gray and Schucany (1972). A 
similar technique was used by Gaver and Hoel (1970) 
in examining small-sample Poisson probability es- 
timates. Some price may be paid in inflation of 
the variance of the estimator (Miller, 1964; also 
Goodman, Lewis, and Robbins, 1972, for a specific 
case) . 

In COMPSTAT the jackknife is quite simply incor- 
porated into the STATISTICS GENERATOR. 

e) Variance Estimation . 

The problem of bias appears to have been alleviated 
directly by the jackknife, and indirectly because 
of a suggestion by Tukey (1958) that the sample 
standard deviation based on the pseudo-values in 
the jackknife procedure be used to estimate the 
variance of the jackknifed estimate. There is 
some evidence that this procedure is broadly 
applicable, although Miller (1968) pointed out 
cases where it can give poor results. In general, 
n-fold jackknifing in a small sample of size n 
can give an estimate with a very inflated variance, 
though this problem disappears as n-*°°. Relevant 
references are Arvesen (1969), a review by Arvesen 
and Salsburg (1972), and Mosteller and Tukey (1968). 

The -jackknifing procedure will probably be most 
useful when available computation time is too short 
for sectioning. For a description of variance 
estimation techniques based on sectioning, see 
Mosteller and Tukey (1968) . 

f ) Variance Reduction Techniques . 

I have not discussed variance reduction techniques 
so far. An excellent review is given by Gaver 
(1969); see also Hammersley and Handscomb (1964). 
These variance reduction techniques can be imple- 
mented in COMPSTAT but there seem to be several 
drawbacks, mainly that the methods are particular 
to the problems at hand. Thus, a large amount of 
time can be spend deriving, say, an antithetic 
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variable estimate is 



variate for a particular problem and this may, when 
large computers are available, be an inefficient 
way to use statisticians. 

The most important drawback to most methods, how- 
ever, is that a method that reduces the variance 
of an estimate of the mean of a statistic S will 
often inflate the variance of an estimate of the 
variance of S. This is clearly true for many 
antithetic variate techniques (Hammersley and 
Mauldon, 1956) and would be worse when quantiles 
or percentiles are also required. This may be 
all right in nearly normal situations, but not in 
others . 

Much more research is required on variance reduction 
techniques that are applicable to all aspects of 
the characterization of a distribution, and are 
easily derived. Control variable techniques 
(Fieller and Hartley, 1959) seem to me the best 
candidate for this role. 

An empirical control variable technique can be im- 
plemented with COMPSTAT when exploration is re- 
quired around a null situation. This may, for 
instance, be a test of hypothesis in which power 
against small deviations is of interest. Again, 
small variations in scheduling algorithms in com- 
plex queues might be of interest to see what 
improvement they make to, say, throughput time. 

One might then do a very precise simulation of the 
characterizations of the statistic under the null 
hypothesis. Fix m f , the number of replications 
per section, and let r, the number of sections, 
be large and denote by t^(r) the estimated 
quantity under the null hypothesis. This will be 
the average of the estimates of 6^ from the r 
sections. Results for the sections are kept in 
the SIDEPUT, together with the seed for the random 
number generator which initiates each section of 
the simulation. The quantity is estimated under 
alternative conditions using the same random num- 
bers using only r f sections, where r f << r. 

Call this quantity ©^(r 1 ). If ^(r f ) i s the 
null (average) estimate from the first r’ sec- 
tions, f^(r-r f ) the null (average) estimate from 
the last r - r T sections, then the control 



e £ (r') = e £ (r’) - e 0 (r') + e 0 (r) 

= V r ’> - (£z r> V r,) + (£ t-> V r_r,) - 

Then _ t 

var[ 6 £ ( r') ] = var [ 6 £ (r’) ] + (-^) 2 var[ r Q (r') ] - 

(•“-) cov [ 0 £ (r ' ) 0 o (r') ) + ( Iz ^-)var[ ^(r-r') ] . 



The common random numbers used to generate the es- 
timates should make the estimates 6 £ (r f ) and 
£g(r f ) highly correlated, and the above equation 
is the variance in the usual control variable sit- 
uation except for the last tern'. If r is large 
relative to r f this last term should be small 
relative to the other terms. 



It is possible to use subsequent sections of size 
r f in the original simulation of to explore 

other alternatives, say 6. , £ , .... There are 

ti e 2 

interesting design and analysis problems in this 
scheme which will be explored elsewhere. 



One final point should be made here about control 
variables. Let G be the uncontrolled estimate 
and £ the controlled estimate (generated from 
the same random numbers). It is not often real- 
ized that even with a regression adjusted control 
(see Gaver, 1969) the maximum attainable variance 
reduction is 



var ( 0 ) _ . 
var( 5) 



where P is the correlation between 6 and G. 

It can be very difficult and time-consuming, es- 
pecially for the inexperienced practitioner, to 
find a control which gives a high enough p to 
justify the pratitioners time. And in many cases 
equivalent speed ups can be achieved by using more 
efficient random number generators, ordering rou- 
tines, etc. 



The time factor to achieve a high p is one rea- 
son for putting forward the empirical scheme above. 



g) Planning Simulation Experiments . 

The empirical control variable suggestion in the 
previous section brings up the whole question of 
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the design of simulation experiments. Thus, it 
would be reasonable to use the empirical scheme, 
or plan an experiment around the null value C^? 
This would be appropriate if the range of para- 
meters of interest were known in advance. The 
empirical control variable technique seems at- 
tractive as an on-line, interactive procedure, 
especially when estimates of the variances of the 
estimates are available, as in COMPSTAT. Some 
formal analysis is still needed and this could 
be formidable. 

In general, it would seem that the output of 
large-scale simulation would be a fertile field 
for application of techniques of analysis of var- 
iance and experimental design. I am not familiar 
with much by way of specific applications; several 
recent books, including that by Mihram (1972), 
which I have not examined carefully, do treat 
analysis of simulation experiments. The tendency, 
however, does seem to be to just regurgitate the 
old theory without specifically worrying about 
particular problems of simulation experiments. 

A simple case occurs when an experimenter has two 
variance reduction techniques available, say two 
control variables, and a fixed number of repli- 
cations m he can perform. He wants to choose 
the control variable which minimizes the variance 
of the final estimate of a parameter, say G, 
which could be the mean of a statistic S. If 
m’ is large enough so that the estimates in each 
of the r sections (rm ,= ni) are unbiased and 
normally distributed, this is a classical two arm 
bandit problem. 

I know of no one, however, who has actually done 
this, probably because the benefit of reduced 
variance doesn't outweigh the extra cost of tool- 
ing up for two estimates of 6. It could be 
feasible with COMPSTAT. Once more than one para- 
meter is involved, say the mean and variance of 
S, the problem is much more complicated. In 
general I think, however, that as computers de- 
velop simulation will make many statistical prac- 
tices developed in vacuo widely useful. 
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seen by considering Figure 2, where estimated quan- 
tiles of a distribution are plotted. One would 
generally want to smooth these plots or fit some 
regression function to assess the rate of conver- 
gence to the asymptotic normal distribution. There 
are problems in that the number of simulations, m, 
was fixed in advance and thus, the variances at 
each n vary. Moreover, one would want to couple 
the smoothing or regression analysis of the var- 
ious quantiles. These are both functionally and 
statistically correlated for each n across 
quantiles and with n for each quantile. 

Detailed analysis of such graphic output needs 
much more work; it is possible that the work of 
Efron and Morris (1972) may be relevant to this 
problem . 

besides the smoothing, any program such as COMPSTAT 
should provide facility for direct plotting of 
output tables of rounded and perhaps smoothed data. 
This is one facility computer scientists can pro- 
vide us with. 

3. MISCELLANEOUS PROBLEMS AND OPEN QUESTIONS. 

I have not touched on many questions in large-scale 
simulation. A few are discussed here to emphasize 
that there are many problems that do not even start 
to fit on present or future computers. Thus, sim- 
ulation, especially without some analytic support, 
is not always a possible way out of problems, al- 
though some people feel simulation is the last 
resort. Other questions discussed below indicate 
that there are simple problems we cannot handle. 

a) Conditional distributions. 

Conditioning poses problems in simulations which 
I do not know how to brand le efficiently. Thus, in 
fitting exponential polynomials to data from a non- 
homogeneous Poisson process (Lewis, 1972) observed 
for a time t^, one wants to condition on the 
number, n, of events observed in (0,t^). The 
times to events t. are then order statistics 

l 

from a uniform random sample of size n. In test- 
ing for a second order term in the polynomial one 

2 

wants the conditional distribution of Et^, given 
n and Et^. Conceptually this is simple to see, 



Some of many other open design problems can be 



